In this blog post we will investigate the predictive power of song lyrics and engineered features. We will be trying to predict the genre of a song using its lyrics and the features. We will first train a model on just the lyrics, then on just the features, and finally on both. We will compare the results and see if there is any improvement when using both features and lyrics. For the lyrics we will use BERT embeddings which will be plugged into a neural network similar to the engineered features. For the combined model we will concatenate the results from both the lyrics and the features before passing them through a final couple of layers. We will end by examining the embeddings of the lyrics and the features to see if there are any interesting patterns.
Data
The data we will be using is from Spotify and contains information about songs, their lyrics, and their features. As we can see below, the data has a lot of information about the songs including the genre which we will be using as our target variable.
Now that we have the data, one interesting thing we can do is explore some of the engineered features and genres. An interesting question to ask is whether pop music has gotten more danceable over the years. Since we have release date information as will as a dancability measure we can group by year and calculate the mean dancability after filtering for just pop data.
pop_data = df[df['genre'] =='pop']pop_danceability = pop_data.groupby('release_date')['danceability'].mean().reset_index()plt.figure(figsize=(10, 6))plt.plot(pop_danceability['release_date'], pop_danceability['danceability'], marker='o', label='Danceability')plt.title('Has Pop Music Gotten More Danceable Over Time?', fontsize=14)plt.xlabel('Year', fontsize=12)plt.ylabel('Average Danceability', fontsize=12)plt.grid(True)plt.legend()plt.show()
As we can see, it appears that pop music has gotten more danceable over the years. This is an interesting finding and shows that we can use the engineered features to gain insights into the data.
Preprocessing
As the genre is a string, we will need to convert it to a number for the model to be able to use it. We will do this by using the LabelEncoder from sklearn. This will allow us to map the genres to numbers which will serve as our categories for the target variable. We output the count of each genre along with its mapping to see the imbalance in the data.
As we can see, the data is relatively imbalanced with pop as the most common genre. We will use this distribution to compute our baseline accuracy which will be the accuracy of the most common genre. The following code does this and we see that the baseline accuracy is 0.248. This is important to keep in mind when we are evaluating our models as we will need to beat this accuracy to show that our models are actually learning something. If we simply predict pop every time we will achieve an accuracy of 0.248 so we expect our models to do better than this.
Now that we better understand the data, the next step is to create the embeddings for the lyrics. We will use the BertTokenizer from Google to create the embeddings. To do this we first have to preprocess the lyrics by padding them out to the same length. This has the effect of adding padding tokens to the end of the embeddings so they are all the same size and can be used in the model. After splitting the date into training and validation sets we will preprocess the lyrics so that we have the embeddings ready to be used in the models.
max_len =512df["lyrics"] = df["lyrics"].apply(lambda x: " ".join(x.split()[:max_len]))train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")def pad(l, max_len):iflen(l) <= max_len: to_add = max_len -len(l)return l + [0]*to_addelse:return l[:max_len]def preprocess(df, tokenizer, max_len): X = tokenizer(list(df["lyrics"]))["input_ids"] X = [pad(t, max_len) for t in X] y =list(df["genre_numeric"])return X, yclass TextDataFromDF(Dataset):def__init__(self, df):self.X, self.y = preprocess(df, tokenizer, max_len)def__getitem__(self, ix):returnself.X[ix], self.y[ix]def__len__(self):returnlen(self.y)train_lyrics = TextDataFromDF(train_df)val_lyrics = TextDataFromDF(val_df)
Model 1
In order to create models we first have to create data loaders for the training and validation sets. We will use the DataLoader from torch to do this. We will use the same dataloader for all of the models. Finally we define the model which is a simple feed forward neural network with three hidden layers after the embeddings.
def collate(data): lyrics = torch.tensor([d[0] for d in data]) features = torch.tensor([d[2] for d in data]) y = torch.tensor([d[1] for d in data])return lyrics, features, y engineered_features = ['dating', 'violence', 'world/life', 'night/time','shake the audience','family/gospel', 'romantic', 'communication','obscene', 'music', 'movement/places', 'light/visual perceptions','family/spiritual', 'like/girls', 'sadness', 'feelings', 'danceability','loudness', 'acousticness', 'instrumentalness', 'valence', 'energy'] train_features = train_df[engineered_features].values.tolist()val_features = val_df[engineered_features].values.tolist()train_data = [(train_lyrics[i][0], train_lyrics[i][1], train_features[i]) for i inrange(len(train_lyrics))]val_data = [(val_lyrics[i][0], val_lyrics[i][1], val_features[i]) for i inrange(len(val_lyrics))]train_loader = DataLoader(train_data, batch_size=8, shuffle=True, collate_fn = collate)val_loader = DataLoader(val_data, batch_size=8, shuffle=False, collate_fn = collate)class TextClassificationModel(nn.Module):def__init__(self,vocab_size, embedding_dim, max_len, num_class):super().__init__()self.embedding = nn.Embedding(vocab_size+1, embedding_dim)self.fc1 = nn.Linear(embedding_dim, 128)self.fc2 = nn.Linear(128, 64)self.fc3 = nn.Linear(64, num_class)self.dropout = nn.Dropout(0.3)self.relu = nn.ReLU()def forward(self, lyrics, features): x =self.embedding(lyrics) x = torch.mean(x, dim=1) x = torch.flatten(x, 1) x =self.dropout(x) x =self.relu(x) x =self.fc1(x) x =self.dropout(x) x =self.relu(x) x =self.fc2(x) x =self.dropout(x) x =self.relu(x) x =self.fc3(x)return xvocab_size =len(tokenizer.vocab)embedding_dim =25num_class =max(df["genre_numeric"]) +1lyrics_model = TextClassificationModel(vocab_size, embedding_dim, max_len, num_class).to(device)
With the model defined we can now train it. We define a training function to use the dataloader and pass the lyrics through the model. We will use the Adam optimizer and the CrossEntropyLoss loss function. We will train the model for 20 epochs and print the loss and the time taken for each epoch. Afterward we will evaluate the model on the validation set and print the accuracy.
As we can see, the model improves over time but the accuracy on the validation set is not very high. This indicates that the model is likely overfitting to the training set. That said, it still manages to achieve an accuracy of 0.281 which is better than the baseline accuracy. This shows that the model is learning something from the lyrics and is able to predict the genre of the song to a marginal extent.
Model 2
The second model will be similar to the first but will use the engineered features instead of the lyrics. We will use similar data loaders and the same training function as before. This model has 3 hidden layers and uses the ReLU activation function. We will train the model for 20 epochs and print the loss and the time taken for each epoch. Afterward we will evaluate the model on the validation set.
As we we can see the model is able to achieve an accuracy on the validation set of 0.378 which is better than the first model. This may show that the features are more informative than the lyrics. On this one, the validation accuracy does not diverge from the training accuracy as much as the first model which indicates that the model is not overfitting as much.
Model 3
The third model will be a combination of the first two models. We will concatenate the results from the lyrics and the features before passing them through a final couple of layers. The lyrics model has 2 hidden layers and the features model has 3 hidden layers. Both use the ReLU activation function. We will train the model for 20 epochs and print the loss and the time taken for each epoch. Afterward we will evaluate the model on the validation set.
As we can see the model improves slightly over the epochs though it takes a lot longer. The results on the validation set are better than the second model by 1.5 percentage points which demonstrates that combining the features with the lyrics performs better than either on their own. It is likely that we could have achieved better results with more tinkering to optimize the layers as the final accuracy of 0.393 is still not very good. That said it is higher than the baseline accuracy so it is still a step in the right direction.
Examining the Embeddings
The final step is to extract the embedding from the model and examine them to look for patterns. We will use Plotly for interactive plotting and PCA to reduce the dimensionality of the embeddings so we can plot them. While we are able to plot the embeddings we do not see many clear patterns. The words are pretty evenly distributed and there are no clear clusters.
In this blog post we examined the feasibility of using song lyrics and engineered features to predict the genre of a song. We trained three models, one on just the lyrics, one on just the features, and one on both. We found that the model trained on the features performed better than the one trained on the lyrics. The combined model performed slightly better than the features model but not significantly so. We also examined the embeddings from the models and did not find any clear patterns. Overall, this shows that while it is possible to use song lyrics and engineered features to predict the genre of a song, there is still a lot of room for improvement in our methods.